Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A specialized Winograd Conv2d op #971

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

bssrdf
Copy link
Contributor

@bssrdf bssrdf commented Sep 29, 2024

This PR added a new conv2d op using Winograd algorithm.

Currently ggml's conv2d operator uses im2col and GEMM. There have been efforts to speed up this process using other faster algorithms. Winograd is such a method used by many neural network libraries, e.g. Cudnn . For small kernels, e.g. 3x3, Winograd outperforms GEMM based methods. However, efficient implementation of Winograd on GPUs requires significant engineering efforts. This PR 's Winograd implementation specializes in several ways:

  • It only supports 3x3 kernel
  • It only supports channel numbers of multiples of 8
  • It only supports filter output numbers of multiples of 64
  • It only supports stride=1, padding = 1 and dilation = 1
  • It only supports CUDA backend

Other features:

  • Fused, except the kernel transform which requires additional workspace (can be shared with the weights)

It is mainly used for applications such as stable-diffusion.cpp.

The code is based on openCNN project which uses Apache-2.0 license.

Please review and let me know any problems I'll address. Thanks.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is based on openCNN project which uses Apache-2.0 license.

Did you get permission from the authors to re-license their code as MIT?

src/ggml-cuda.cu Outdated Show resolved Hide resolved
src/ggml-cuda/conv-winograd.cu Outdated Show resolved Hide resolved
src/ggml-cuda/conv-winograd.cu Outdated Show resolved Hide resolved
typedef float(*pointFunction_t)(float *, int);

template<typename T>
__global__ void FX(const T *pInputs, float *pOutputs, int filt_k,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__global__ void FX(const T *pInputs, float *pOutputs, int filt_k,
__global__ void FX(const T * __restrict__ pInputs, float * __restrict__ pOutputs, int filt_k,

On Pascal this can be a 5x speedup.


}

__device__ __forceinline__ void prefetch_filter_tile(const float *pInputs, float *tiles, int filt_k){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler will rearrange these instructions as it sees fit so there will in effect not be any actual prefetching. For that you need to use asnychronous memcpys (Ampere or newer).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure here, as this is done by openCNN.

Comment on lines +11 to +13
__constant__ int access_f_s[2][32];
__constant__ int access_s[2][32];
__constant__ int tileid[2][32];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the case of multiple GPUs? Is the constant memory duplicated across GPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty ignorant about multi-gpu. I guess they will be duplicated. I don't have a setup to test. Plus, this kernel only works for single GPU, I think.

Comment on lines +15199 to +15213
static void ggml_compute_forward_winograd_stage0(
const struct ggml_compute_params * params,
struct ggml_tensor * dst) {

GGML_ASSERT(false && " CPU backend not implemented!");
return;
}

static void ggml_compute_forward_winograd_stage1(
const struct ggml_compute_params * params,
struct ggml_tensor * dst) {

GGML_ASSERT(false && " CPU backend not implemented!");
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If at all possible a CPU implementation should always be done since it serves both as a fallback and as a reference implementation to test other backends against.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A CPU backend should be done, but I am not sure the benefit of it compared to the current im2col+gemm version.

Comment on lines +7181 to +7185
bool is_node = false;

if (a->grad) {
is_node = true;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #966 is merged first this will need to be removed (should be very straightforward).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look forward to it...

@JohannesGaessler
Copy link
Collaborator

Have you done any tests regarding performance? This code does not use tensor cores at all so intuitively I would expect it to be slower than im2col + GEMM with tensor cores.

@bssrdf bssrdf marked this pull request as draft September 29, 2024 17:40
@bssrdf
Copy link
Contributor Author

bssrdf commented Sep 29, 2024

Have you done any tests regarding performance? This code does not use tensor cores at all so intuitively I would expect it to be slower than im2col + GEMM with tensor cores.

Thank you for your review, @JohannesGaessler. I leaned a lot from your PRs and comments.

First, I have asked openCNN's author for license issue.

As to performance, I only tested it in SD.cpp as it is developed for it. It is not faster (surprised) than im2col+GEMM with tensor cores (my gpu has them so assuming being used) but definitely not slower. It reduces memory used by VAE quite a lot while increasing UNET param buffer. There is room to further improve its performance as I see several places are not working in an optimal way.

I'll add test cases in test-backend-ops to more rigorously measure performance.

I addressed your other comments above.

@slaren
Copy link
Collaborator

slaren commented Sep 29, 2024

Have you tried NPP? It is a library bundled with the CUDA toolkit that has all kinds of kernels for image processing. I don't think this can be merged unless the license situation is resolved.

@JohannesGaessler
Copy link
Collaborator

Generally speaking my stance regarding this PR would be as follows: I think it's good to have convolution operations instead of having to rely on IM2COL. At the same time I want to have a codebase that is easy to maintain - a central factor for me is that there needs to be some benefit for adding code that offsets the increase in maintenance effort. Quite honestly I think that the starting point from OpenCNN is not very good; I would be rather hesitant to add it since the use cases are limited and I think none of the devs on this project would have a very good understanding of how the code works.

And as slaren said, the licensing issue must be resolved or this is a total non-starter anyways.

Have you tried NPP? It is a library bundled with the CUDA toolkit that has all kinds of kernels for image processing.

From what I can tell, there is convolution support.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Sep 29, 2024

I'm already tired so maybe I'm just misreading the docs, but I get the impression that NPP convolutions only support 1-4 input channels.

@bssrdf
Copy link
Contributor Author

bssrdf commented Sep 29, 2024

Thanks to both of you for reviewing. I am not familiar with the license. In case it is not resolvable, I'll ditch this PR.
Now I am putting it in draft mode, hoping to make it work in more general settings, truly serving as an alternative to im2col approach.

@JohannesGaessler
Copy link
Collaborator

Also one important question that I forgot to ask: are you going to be available long-term to maintain this code?

@bssrdf
Copy link
Contributor Author

bssrdf commented Sep 30, 2024

Also one important question that I forgot to ask: are you going to be available long-term to maintain this code?

If this PR makes into the main, I intend to maintain it long term and improve its performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants